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Abstract. We present herein an extension of an algebraic statistical 
method for inferring biochemical reaction networks from experimental 
data, proposed recently in [3]. This extension allows us to analyze reac- 
tion networks that are not necessarily full-dimensional, i.e., the dimen- 
sion of their stoichiometric space is smaller than the number of species. 
Specifically, we propose to augment the original algebraic-statistical al- 
gorithm for network inference with a pre-processing step that identifies 
the subspace spanned by the correct reaction vectors, within the space 
spanned by the species. This dimension reduction step is based on prin- 
cipal component analysis of the input data and its relationship with 
various subspaces generated by sets of candidate reaction vectors. Sim- 
ulated examples are provided to illustrate the main ideas involved in 
implementing this method, and to asses its performance. 
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1 Introduction 

Modern biological research often involves collecting detailed information 
on time-dependent chemical concentration data for complex networks or 
pathways of biochemical interactions [5, 13]. Often, the main purpose of 
collecting such data is to extract information on the structure of a net- 
work of biochemical reactions for which only the identity of the chemical 
species present in the network is known, with little or no information on 
the structure of the network of interactions between these species [14]. 
This problem is of particular interest in the context of molecular and 
systems biology, and has received a lot of attention in the literature for 
a long time [1, 8-10, 12, 17-20]. 

Two different reaction networks might generate identical (deterministic) 
mass-action models, making it impossible to discriminate among them, 



even if we are given experimental data of perfect accuracy and unlimited 
temporal resolution (this lack of uniqueness is sometimes referred to 
as the "fundamental dogma of chemical kinetics" [5-7]). Necessary and 
sufficient conditions for two reaction networks to give rise to the same 
deterministic mass action model are described in [4], where the problem 
of identifiability of mass action reaction network models was analyzed in 
detail. The key observation is that, if we think of reactions as vectors, 
it is possible for different sets of such vectors to span the same positive 
cones, or at least to span positive cones that have nonempty intersection 
(see Figure 1 in [4] for a simple example). 

Often, the experimental measurements for the study of a specific reac- 
tion network or pathway are being collected under many different exper- 
imental conditions, which affect the values of reaction rate parameters. 
Moreover, the reactions of interest may not be "elementary reactions" , 
for which the reaction rates parameters must be constant, but so-called 
"overall reactions" which summarize several elementary reaction steps. 
In that case the reaction rates parameters may reflect the concentrations 
of biochemical species which have not been included explicitly in the 
model. Then the reaction rate parameters are not constant, but rather 
depend on specific experimental conditions, such as concentrations of 
enzymes and other intermediate species. Therefore, the estimated vector 
of reaction rate parameters will not be the same for all experimental 
conditions, but each specific experimental setting will give rise to one 
such vector of parameters. However, the set of all these vectors should 
span a specific cone, whose extreme rays should identify exactly the set 
of reactions that gave rise to the data [4] . 

Based on these geometric considerations, we proposed a statistical method 
[3] which allows us to take advantage of the inherent stochasticity in the 
data, in order to determine the unique reaction network that can best 
account for the results of all the available experiments pooled together. 
This idea is related to the notion of algebraic statistical model (see Chap- 
ter 1 in [15]), and relies on mapping the estimated reaction parameters 
into an appropriate convex region of the span of reaction vectors of a 
network, and using the underlying geometry to identify the reactions 
which are most likely to span that region. This approach reduces the 
network identification problem to a statistical inference problem for the 
parameters of a multinomial distribution, which may then be solved us- 
ing classical likelihood methods. The description of the computational 
approaches to analyzing the geometry of convex regions of interest in 
our setting may be found in [11]. 

In the method described in [3] we have made the assumption that the 
dimension of the span of the "true" reaction vectors is maximal, i.e., it 
equals the total number d of species; this assumption allowed us to relate 
(i) the likelihood of a d-dimensional convex cone spanned by a particular 
set of d reaction vectors with (ii) the number of experimental data points 
that lie in this cone. 

In this paper we extend the applicability of the maximum likelihood 
method to reaction networks that are not necessarily full-dimensional in 



Fig. 1. Simulated data points for the network {Ao — ► 2^1,^4.0 — * 2A2} were obtained 
by using the procedure described in Section 3. Gaussian noise of mean zero and SD 
0.15 was also added. Note that the points remain close to the positive cone spanned 
by the two reaction vectors. This cone has dimension two, while the space spanned by 
the species has dimension three. 



the sense explained above. An example of such a network is 

A -> 2Ai, A ~*2A 2 , (1) 

where we have three species but the reaction vectors [—1, 0, 2] and 
[— 1, 2, 0] span a two dimensional cone. In Section 3 of the paper we ex- 
plain how we can represent the data as geometric coordinates of points in 
the interior of the cone generated by the true reaction vectors. A cloud of 
data points simulated from network (1) is depicted in Figure 1. Although 
it is altered by noise, the data set has a flat configuration parallel to the 
plane of 2Aq— A\~ A 2 = determined by the stoichiometric vectors. The 
new pre-processing algorithm introduced in this paper will recognize the 
true dimensionality of the data using an eigenvalue analysis approach, 
and will process the data so the maximum likelihood method in [3] can 
be applied. 



2 Maximum Likelihood Method for Inferring 
Biochemical Reaction Networks 



Given experimental data that originates from a large network of possible 
reactions, a systematic way of inferring the subset of reactions that are 
most likely to generate the data was introduced in [3]. The method uses 
a statistical model that parametrizes the probabilities that the given 
reactions are part of the correct subset. The three main assumptions for 
the model are 

1. the network is conical, i.e., all reactions have a common source, 

2. the true reactions generate linearly independent reaction vectors, 

3. the stoichiometric matrix has full rank, i.e., the linear span of the 
true reaction vectors is the same as the span of the species. 

Assumption 1 does not actually limit the scope of this method since, as 
shown in [4], the network inference can be done one source at a time. On 
the other hand, the full-dimensionality assumption 3 does not hold for 
large classes of networks, for example networks that obey one or more 
linear conservation laws, such as the network (1). In this paper we extend 
the applicability of this maximum likelihood method to these classes of 
reaction networks by introducing a dimension reduction pre-processing 
step. In future work we intend to analyze the case where assumption 2 
does not hold as well. 

The general setup of the maximum likelihood method in [3] is as fol- 
lows. We consider d species and a set of m possible reactions R = 
{Ri, . . . , Rm} involving these species, and we denote by TZd the set of 
all (™) positive cones spanned by subsets of d reactions in R. Further 
we denote by Cone(R) the positive cone generated by the reaction vec- 
tors in R and by {Si, ...,£„} the partition of Cone(R) into smallest 
full-dimensional sub-cones obtained by taking all possible intersections 
of cones in TZd . The full dimensional cones Si, . . . ,S n are referred to as 
building blocks. 

A parameter 9i, i = l,...m is defined as the conditional probability 
of the reaction Ri to be involved in generating the data, given that 
there are d true reactions. These parameters are used to define functions 
Pj(6i, . . . ,9 m ), j = 1, • • • , n which represent the probability that a data 
point is observed in the building block Sj, conditioned on the existence 
of exactly d true reactions. The construction of the Pj 's uses appropriate 
volume measurements to asses the probability that a point lies inside a 
certain cone. In doing so, the method disregards the zero volume (de- 
generate) cones and it is therefore not appropriate for data generated 
from a network for which the true reactions do not span a full dimen- 
sional cone. See [3, Section 2] for detailed explanation of the geometric 
methods in the construction of pj . A likelihood function is constructed 
using the functions pj and also involves counting the data points within 
each building block Si, . . . ,S n - We then infer the true reactions from the 
values of 8i that maximize the likelihood function. 

The algorithm was implemented using Matlab [2] and has been made 
publicly available at http://www.math.wisc.edu/~pantea. To illustrate 



the implementation, consider the set of reactions in (2), and data gen- 
erated from all these reactions except for Ao — > A 2 , using independent 
rate constants drawn from a Gamma(1.5, 1) distribution (see also the 
discussion in [3, Example 7]). 

Ai 2A 3 A 2 



(2) 



Ai +A 2 A 2 + A 3 

Once the likelihood function is constructed, the algorithm runs an op- 
timization procedure for the maximum of the likelihood, repeated 2 m ~ 1 
times (e.g., 16 times for network (2)), with restarts from random ini- 
tial values of 6. A list of (local) minima is created, and the algorithm 
reports the value of 9 that realized the smallest local minimum, to- 
gether with the percentage of time the algorithm ends up at that par- 
ticular point (i.e., the success rate). An example of such a report for 
R = {A -> 2A 3 , A -> Aj,Ao -> Ai +A 2 ,A -> A 2 + A 3 ,A -> A 2 } is 

Minimum of negative log-likelihood: 6.94 
Theta: 

11110. 
Hits: 16 out of 16, 100 1 /,. 

This shows that the algorithm selects the correct four reactions as the 
ones generating the data by assigning them probability 1, and discards 
the last (incorrect) reaction Ao — > A 2 by assigning it probability 0. 




3 Numerical Example with Simulated Input 
Data 

We describe our approach for dimension reduction by analyzing in detail 
a modified version of reaction network (2). Let Ai, i = 0, 1, 2, 3, denote 
four chemical species and consider the following augmented network of 
possible reactions: 



Ai 2A A A 2 




Ai + A 2 2A 2 A 2 + A 3 
where we suppose that the true reactions that generate the data are 

A -> Ai + A 2 , A -> A 2 + A 3 , and A -» 2A 3 . (4) 



In this setting we would like to test whether our algorithm would cor- 
rectly identify the three reactions (4) out of the network (3) as being 
responsible for generating the data. 

Note that unlike [3, Example 8] , where the set of true reactions was taken 
to be 

A -> A 1 , A -» Ai + A 2 , A Q -> A 2 + A 3 , and A -» 2A S , (5) 

the set of true reactions (4) spans a three-dimensional (instead of four- 
dimensional) space. Since the original maximum likelihood algorithm as 
proposed in [3] required full-dimensionality of the set of true reactions, it 
cannot be applied directly in our case. A pre-processing step is required 
to determine the correct dimensionality of the data. 

Data generation. We use the true reactions (4) above to simulate "ex- 
perimentally measured" data and to test whether our dimension reduc- 
tion procedure, followed by the likelihood-based algorithm from [3], is 
able to identify the true reactions from the the possible six reactions of 
network (3). 

The deterministic dynamics of the chemical reaction network (3) is gov- 
erned by linear differential equations of the form 

dA t /dt = lt A i = 0,...,3. (6) 

where the parameters 7,, i = ... 3 are linear combinations of the rate 
constants, given by the stoichiometry of the true network. 
To obtain the simulated measurements of the concentrations of Ao , A\, A2 
and A3 we independently draw the three rate parameters of the true reac- 
tions from a gamma distribution Gamma(a, A) with parameters a = 1.5 
and A = 1 and use a standard Matlab ODE solver to generate trajec- 
tories of the resulting mass-action system. We then fit the simulated 
trajectories to system (6) using the Matlab function lsqcurvef it and 
obtain a single resulting data point K = (71,72,73,74). This procedure 
is repeated to generate 20 data points. 

The data points Ki, i = 1, ... 20, can be viewed geometrically as points 
in the coordinate system given by the species [Ao, A\, A2, A3]. This in- 
terpretation is not related to any choice of the reactions. However, if we 
disregard the fitting error, each data point lies inside the three dimen- 
sional open convex cone generated by the true reaction vectors. More- 
over, as explained in [4] , the corresponding conical coordinates of Ki are 
precisely the estimates of the true rate constants. 

Dimension reduction. Although our data is given as four dimensional 
points, it is in fact generated by a reaction network whose stoichiometry 
is three dimensional. Therefore the data points are confined to a three- 
dimensional subspace of the canonical species space [Ao, Ai, A2, A3], or 
are very close to a three-dimensional subspace, if one accounts for the 
error that occurs in the estimation. 

We introduce a new first step to our algorithm, which consists of recov- 
ering the true dimensionality of the data. To this end we use a standard 



eigenvalue method, namely we look for small eigenvalues of the DD' ma- 
trix, where D is the 4 x 20 matrix of data points. Here "small" eigenvalues 
correspond in fact to zero eigenvalues, altered by estimation error. The 
corresponding eigenvectors impose linear constraints on the data, and 
consequently on the true reaction vectors. We can therefore discard the 
reaction vectors that do not satisfy the aforementioned constraints. 
For the network (3) with the true reactions (4) the first output of the 
algorithm is 

10 2 
10 2 
10 11 
1110 



The algorithm has correctly concluded that the data points should belong 
to the hyperplane 2A a +A\ + A2+A3 — and that the true reactions are 
among the four whose reaction vectors are included in this hyperplane, 
namely 

A Q -> 2A 2 , A Q -> 2A 3 , A -» A 2 + A 3 , A -> Ai + A 2 . 

The dimension reduction is completed by projecting our data onto the 
true stoichiometric subspace above. The algorithm then computes an 
orthonormal basis of this subspace and rewrites the data in the new 
coordinates. The resulting data matrix is of dimension 3 x 20 and the 
maximum likelihood algorithm in [3] can now be applied and produces 
the following output: 

Minimum of negative log-likelihood: 
Theta: 

111 
Hits: 8 out of 8, 1007. . 

which refers to the reaction vectors labeled "possible" and correctly con- 
cludes that the true reactions are the ones corresponding to the last three 
reaction vectors from the "possible" list. 

Note that the new coordinates of the data points projected onto the 
subspace of the true stoichiometry do not have an immediate geometric 
interpretation, as was the case for the the four coordinates (70, 71, 72, 73). 
However, the geometric arguments that make the likelihood method work 
can clearly be applied in the projected configuration. By choosing an or- 
thonormal basis for the true stoichiometric subspace, we are ensuring 
that angles and lengths are consistent between the two coordinate sys- 
tems. 

Threshold for eigenvalues. Our dimension reduction step is depen- 
dent upon isolating small eigenvalues, which would be zero had it not 
been for the estimating error. Therefore we need a threshold value to 
separate the eigenvalues that are small because they correspond to zero 
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Fig. 2. Density plot for the empirical distribution of the smallest nonzero absolute 
value A among the eigenvalues of DD' . The reactions that generate D are {Aq — > 
2A 2 ,A -> Ai + A 2 .} 



eigenvalues perturbed by error. To obtain this threshold, we use a Monte 
Carlo simulation to construct the empirical distribution of the smallest 
nonzero absolute value of an eigenvalue of DD' . The data matrix D has 
as many points as the real data (20 in our case) and is generated from 
all possible 2 6 — 1 = 63 choices of the true reactions out of the possible 
6 reactions in network (3) , without any noise added. For each such com- 
bination of reactions a sample of 1000 D matrices was generated using 
independent Gamma(1.5, 1) distributed rate constants. An illustrative 
histogram of the typical resulting empirical density of the smallest eigen- 
value in absolute value is given in Figure 3. This is just one of the 63 
cases, the case where the true reactions are taken to be Ao — > 2A 2 and 
A Q -> A l + A 2 . 

We merge all the 63 frequencies to obtain a unique distribution for the 
network of possible reactions (3) and we use a standard quantile of this 
distribution to define the threshold value. In example (3), the 1% quantile 
gives a value of the threshold of 0.7. 

More numerical results. One could further test the algorithm by 
choosing the set of true reaction vectors set to be even lower dimen- 
sional, e.g., of dimension two (for example ^lo — > 2A 2 , Aq — > 2A3, Ao — > 
A 2 + A 3 ) or one (for example, Aq — > Ai). For the particular network 
considered in the tests we conducted, both of these dimension reduc- 
tions were performed correctly by our algorithm pre-processing step and 
moreover the subsequent maximum likelihood step always had 100% suc- 
cess rate. This is perhaps not surprising because for small dimensional 
problems of this type the maximum likelihood algorithm is expected to 
behave very well. 



The effect of noise. In [3] the effect of the estimating error on the ef- 
fectiveness of the maximum likelihood algorithm was studied using sim- 
ulations. Namely, zero-mean Gaussian noise with varying variance was 
added to a set of data points Ki, i = 1 ... 20 corresponding to the reac- 
tion network (3) and true reactions (5). This was repeated in batches of 
3000 for increasing values of the standard deviation of the added noise 
and the maximum likelihood algorithm for network discovery was per- 
formed. The result of this experiment is presented in Figure 3. As seen 
from the figure with small to moderate random noise added to the values 
of the reaction constants the likelihood method is still able to recover the 
correct set of reactions at remarkably high rate. 




0.0 0.5 1.0 1.5 2.0 
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Fig. 3. The rate of recovery of the correct reactions (5) out of the network (3) as a func- 
tion of the size of the Gaussian noise added to the estimated parameters [71 , 72 , 73 , 74] • 
The recovery rate is over 90% for the Gaussian noise with 0.5 SD and about 50% for 
1 SD. 



We take a similar approach to investigate how the dimension reduction 
step is affected by noise. Just as before, we add zero-mean normal noise 
to the matrix D of data points in batches of 500 for 50 increasing values 
of the noise standard deviation, from to .5. We consider three qual- 
itatively different cases for the true network that generated the data, 
according to the dimension of the corresponding stoichiometry. More 
precisely we chose the one-dimensional reaction network A — > 2A 2 , the 
two dimensional network A Q — > 2A2, Ao — > 2 A3, Aq — > A2 + A3 and the 
three dimensional network Aq — > 2 A3, Aq — > A2 + A3, Ao — > Ai + A2- 
The recovery rate of the true underlying stoichiometric space is presented 



in Figure 3 and has a decreasing shape similar to the one of Figure 3. 
Note, however, that the dimension recovery rate descends much faster 
to zero, being much more sensible to noise. In other words, even if the 
noise does not move the data points from the inside to the outside of 
the full dimensional cones generated by the reaction vectors, it can be 
sufficient to "thicken" the distribution of the data points making it look 
full dimensional and thus misleading the algorithm into choosing a full 
dimensional stoichiometry. 
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Fig. 4. The rate of recovery of the correct stoichiometric space for the augmented 
network (3) as a function of the size of the Gaussian noise added in batches of 500 to 
the 20 data points. The three different curves correspond to different dimensions of the 
true data. 



4 Summary and Discussion 

A likelihood-based algebraic statistical model for inferring biochemical 
reaction networks was proposed recently in [3] , in order to overcome the 
unidcntifiability problem for chemical networks given by a set of reaction 
rate equations. Indeed, as illustrated in the earlier work of some of the 
authors [4] , in the usual deterministic sense such networks are in general 
unidentifiable, i.e., different chemical reaction networks may give rise 
to exactly the same reaction rate equations. The attractiveness of the 
algebraic statistical method was in its ability to take advantage of the 
algebraic and geometric structure of the network rather than merely the 
observed experimental values of the network species, as is commonly 



the case in network inference models based on graphical methods, like 
e.g. Bayesian or probabilistic boolean networks (cf. [16]). The original 
algebraic model described in [3] made the simplifying assumption that 
the stoichiometric subspace of the true network was fully-dimensional, 
i.e., its dimension equaled the number of species. In the current work we 
remove this restriction via a pre-processing algorithm that identifies the 
true dimensionality of the data using an eigenvalue analysis approach; 
then, once we identified the correct network span, we use a restricted 
version of the same algebraic statistical model to identify the correct 
network. We present a small study based on simulated data, where the 
true dimension of a network was 1-3 units lower than the number of 
species, and where our dimension reduction method worked as intended. 
Overall, in the examples provided as well as in other similar numerical 
experiments not reported here, the algorithm with pre-processing for 
dimension reduction performed very well, being able to recover the true 
network structure almost perfectly as long as the level of noise present 
in the data was not too large. 

More challenging studies are needed in order to determine the applica- 
bility of this method for large biochemical networks and for true exper- 
imental data. Also, in future work, we intend to address the case where 
the assumption that the reaction vectors are linearly independent does 
not necessarily hold true. 
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